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Abstract 

The first order stable spline (SS-1) kernel is used extensively in regularized system identification. In particular, the stable 
spline estimator models the impulse response as a zero-mean Gaussian process whose covariance is given by the SS-1 kernel. In 
this paper, we discuss the maximum entropy properties of this prior. In particular, we formulate the exact maximum entropy 
problem solved by the SS-1 kernel without Gaussian and uniform sampling assumptions. Under general sampling schemes, we 
also explicitly derive the special structure underlying the SS-1 kernel (e.g. characterizing the tridiagonal nature of its inverse), 
also giving to it a maximum entropy covariance completion interpretation. Along the way similar maximum entropy properties 
of the Wiener kernel are also given. 
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1 Introduction 

A core issue of system identification is the design of 
model estimators able to suitably balance structure com¬ 
plexity and adherence to experimental data. This is also 
known as the bias-variance problem in statistical liter¬ 
ature. Traditionally, this problem is tackled by apply¬ 
ing the maximum likelihood/prediction error method 
(ML/PEM), see e.g., [1], together with model order se¬ 
lection criteria, such as AIC, BIC and cross validation. 
Recently, a different method has been introduced in [2] 
and further developed in [3,4,5]; see also the recent sur¬ 
vey [6]. Its key idea is to face the bias-variance problem 
via well-designed and tuned regularization. More specif¬ 
ically, the impulse response h(t) is modeled as a zero- 
mean Gaussian process h{t) ~ GP(0, k(t, s; a)), where 
k(t , s; a) is the covariance (kernel) function, and a is the 
hyper-parameter vector, see e.g., [7]. The key step is to 
design a suitable kernel structure which reflects our prior 
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knowledge on the system to be identified, e.g., stability. 
Once k(t , s; a) is determined, a is tuned by maximizing 
the marginal likelihood, and then the posterior mean of 
h(t) is returned as the impulse response estimate. 

Several kernel structures have been proposed, e.g., the 
stable spline (SS) kernel in [2] and the diagonal and cor¬ 
related (DC) kernel in [4], which have shown satisfy¬ 
ing performance via extensive simulated case studies. In 
view of this, it seems interesting to investigate how, be¬ 
yond the empirical evidence, the use of these regularized 
approaches can be justified by theoretical arguments. 
Different perspectives can be taken, e.g. deterministic 
arguments in favor of SS and DC kernels are developed 
in [4] while [8] discusses its link to the Brownian Bridge 
process which suggests the first order stable spline (SS- 
1) kernel is a natural description for exponentially de¬ 
caying impulse responses. In this paper, we will instead 
work within the Bayesian context, discussing the maxi¬ 
mum entropy (MaxEnt) properties of the SS-1 kernel. 

The MaxEnt approach has been proposed by Jaynes to 
derive complete statistical prior distributions from in¬ 
complete a priori information [9]. Among all distribu¬ 
tions that satisfy some constraints, e.g. in terms of the 
value taken by a few expectations, the MaxEnt criterion 
chooses the distribution maximizing the entropy. The 
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justification underlying this choice is that the MaxEnt 
distribution, subject to available knowledge, is the one 
that can be realized in the greatest number of ways, see 
also Jaynes’ Concentration Theorem [9]. A preliminary 
study on the MaxEnt property of kernels for system iden¬ 
tification was developed in [10]. Working in continuous 
time (CT), the problem was to derive the MaxEnt prior 
using only information on the smoothness and exponen¬ 
tial stability of the impulse response. The arguments 
in [10] were however quite involved, mainly due to the 
infinite-dimensional nature of the problem and the fact 
that the differential entropy rate of a generic CT stochas¬ 
tic process is not well-defined. Another recent contribu¬ 
tion is [11] where, under Gaussian and uniform sampling 
assumptions, it is shown that the SS-1 kernel matrix can 
be given a MaxEnt covariance completion interpretation 
[12], that is then exploited to derive its special structure 
(namely that it admits a tridiagonal inverse with closed 
form representation as well as factorization). 

In this paper, we study the MaxEnt properties of the 
discrete-time (DT) SS-1 kernel. We first formulate the 
MaxEnt problem solved by the DT SS-1 kernel without 
Gaussian and uniform sampling assumptions. Then, we 
extend the result of [11] and link it to our former re¬ 
sult: under general sampling assumption, we show that 
the SS-1 kernel matrix is the solution of a maximum en¬ 
tropy covariance extension problem [12] with band con¬ 
straints. This results in the well-known tridiagonal struc¬ 
ture of the kernel’s inverse, which can be also used for 
efficient numerical implementations [13],[15, Section 5]. 
As a byproduct, we discuss the MaxEnt properties of the 
DT Wiener process and its relation with the tridiagonal 
structure of the inverse of its kernel. 

2 MaxEnt property of the Wiener and the SS-1 
kernels 

Recall that the differential entropy H(X)oia real-valued 
continuous random variable X is defined as H(X) = 
— f s p(x ) logp(cc)dx, where p(x) is the probability den¬ 
sity function of X and S is the support set of X. 


Lemma 1 Q] Construct a Gaussian process gift): 
g(to ) = 0 with to = 0, 

k _ Ci-) 

g(tk) = w(ti)^/ti - U-i,k = 1,2,-•• 

i- 1 

For any n G N, it is the solution to the MaxEnt problem 
maximize HthCtf), h(U), ■ ■ ■ ,h(t n )) 

h(t) 

subject to Y(h(ti) — h(U- 1 )) = c (ti — ti-i) 

E (h(ti)) = 0, i = 1, • ■ • , n 

where it is assumed that h(to) = 0 for to = 0. 

The resulting Gaussian process (1) is actually the DT 
Wiener process because it satisfies g(t 0 ) = 0, g(t) is 
Gaussian distributed with zero mean, and has indepen¬ 
dent increments with g(ti) — g(tj) ~ Af(0, c(U — iy)) for 
0 < tj < ti. It can be verified that the DT Wiener pro¬ 
cess has zero mean and covariance (kernel) function: 

Wiener: K Wlenel (t, s; c) = cmin(f, s), t,sGT (3) 
2.2 The first order SS kernel 

Based on Lemma 1, we can derive the MaxEnt property 
for the SS-1 kernel: 

SS-1: A' ss " 1 (t, s; a) = cmin(e _/3t , e~^ s ), 

a = [c /3] T , c > 0, /3 > 0, t,s gT 

It is also introduced independently in a deterministic 
argument in [4] and called the tuned correlated (TC) 
kernel. It is fair to call (4) the SS-1 kernel here, since the 
“stable” time transformation involved in deriving the 
SS-1 kernel plays a key role in the following theorem. 

Theorem 1 Let w(-) be a white Gaussian noise with 
mean zero and variance c. Then the stochastic process 


In the sequel, the objects mainly considered are real¬ 
valued DT stochastic processes defined on an ordered 
index set T = {ti|0 < ti < tj+i, i = 0,1, • • • , oo}. 

A real-valued DT stochastic process w(t) with t G T is 
called a white Gaussian noise if w(t) is identically inde¬ 
pendently Gaussian distributed with mean E (w(t)) = 0 
and variance Y (w(t)) = c. 

2.1 DT Wiener process 

The white Gaussian noise has well-known MaxEnt prop¬ 
erty. On top of it, we can construct a more complex 
Gaussian process with MaxEnt property which is crucial 
to derive the MaxEnt property of the SS-1 kernel. 


n— 1 

h°(t k ) = Y w{e~P ti )\ / e-P ti - e-^+S 

s; (5) 

k = 0, • • • ,n—l, h°{t n ) = 0 with t n = oo 

is a Gaussian process with zero mean and the SS-1 kernel 
(4) as its covariance function, and for any n G N, it is 
the solution to the MaxEnt problem 

maximize Hlhlto), h(t\), ■ ■ ■ ,h(t n - 1 )) 

h(t) 

subject to Y(h(U + i) — h(ti)) = c (e~ /3ti — e -/3ti+1 ) 
E(/i(tj)) = 0, i = 0, • • • , n — 1 (6) 

1 All proofs can be found in the Appendix. 
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( 8 ) 


where it is assumed that h(t n ) = 0 with t n = oo. 


(a) for the Wiener kernel, 


Remark 1 In the optimization criteria (2) and (6), if we 
divide the entropy of the sequence of the stochastic process 
by n and let n go to oo, then the limit (if exists) becomes 
the differential entropy rate of the stochastic process [If]. 
However, the limit does not exist for Gaussian processes 
(1) and (5), which is the reason why the entropy of a 
sequence of stochastic processes is used here instead. 


3 Special structure of Wiener and SS-1 kernels 
and their MaxEnt interpretation 


(pWieneryl _ yyTyyr 

where W is upper bidiagonal with 


= -h±i W (i,i + 1 ) =, 

£i \ ti 1 ^i 


i = 1, • • • ,n—l, W(n,n) = VWr 
(b) for the SS-1 kernel, 


In what follows, we let c = 1 and consider kernel matrix 
P with dimension n > 3 defined as 


Pi,j — K(ti,tj,a), i, j — 1, * , n, tj, tj G P (T) 

where Pij denotes the (i, j)th element of the matrix P 
and K is either the Wiener kernel (3) or the SS-1 kernel 
(4). We find that P has some special structure, e.g., its 
inverse is tridiagonal and its square root has closed-form 
expression. These special structure can be used to im¬ 
prove the stability and efficiency of the implementation 
solving the marginal likelihood maximization, see e.g., 
[13, Remark 4.2], [15, Section 5]. 

Proposition 1 Consider the Wiener kernel (3) and the 
SS-1 kernel (f). Then the following results hold: 


(a) forthe Wiener kernel, det{P Wlener ) = tffffj[{tk+i— 
tk ) and (pW^enery-l - g e g Ua l 


12 

tl ’ 

tj+l—tj-1 



_1_ 

max(£i,tj)— min(£i,£j) ’ 


i = j = 1, 

= j = 2, ••• ,n-l, 
i=j=n, 

\i~j\ > 1 

otherwise, 


(b) for the SS-1 kernel, de^P 55 ^) = e P tn Tff.J[{e @ tk — 


/3tfc+1 ) and is equal to 


l 

e -/3tl — e -pt 2 5 

* = 3 = !, 

_g-^i+l 

( e -P*i—i )(e - Pti _ e -P‘i+i) ’ 1 ~ 

J = 2, • - - ,71 

i_ e ~pt„ > 

II 

Vo. 

II 

3 

o, 

l*-j| > 1 

1 

^ — p min{£^ ,tj } ^ — P max{t^ ,tj } 5 

otherwise, 


Corollary 1 Consider the Wiener kernel (3) and the 
SS-1 kernel (f). Then the following results hold: 


(. pss-iyi = S T S (9) 

where S is upper bidiagonal with 


S(i,i) = —S(i,i+ 1) = 


1 


\J~[ Wh+i 

i = 1, • • ■ ,n—l, S(n,n) = 


e P(t n -t n -i) _ i 

a— f)t n -l _ p — fttn 


Remark 2 From (8) and (9), decomposing P = UU T 
for upper triangular U has closed form expression. For 
the Wiener kernel, U = W~ x with Uij = {Wiff~ 1 ti/tj 
for i > j, i, j = 1, • • • , n. For the SS-1 kernel, U = S~ x 
with U itj = {Si,iff 1 for i > j, i, j = 1, • ■ • , n. 

Remark 3 Recall from e.g., [12] that if X ~ J\f(0,P) 
withPff 1 = 0, thenXi andXj are conditionally indepen¬ 
dent given Xwith k ^ i,j where X & is the kth element 
of X. This means that the Wiener and SS-1 kernels cor¬ 
respond to sparse representation, see e.g., [12] for details 
and also the proof of Corollary 1. 


3.1 MaxEnt covariance completion 


The fact that the kernel matrices of the Wiener and SS-1 
kernels have tridiagonal inverse can be given a MaxEnt 
covariance completion interpretation. 

Recall that a real symmetric matrix A with dimension 
n > m + 1 is called an m—band matrix if Aij = 0 for 
|i — j | > m, and the matrix M is called an extension of 
A if Mij = Ai,j for \i — j\ < m. Moreover, M is called 
a positive extension of A if M is positive definite. A 
positive extension M of the m—band matrix A is called 
a band-extension of A if M~ l is an m—band matrix. 


Theorem 2 Define A £ K" XTt as follows: 


Ai,j — 


p Wiener 


{resp. PfJ' 1 ), \i-j\<l 
0 \i-j\ > 1 


( 10 ) 
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Then P Wiener ( re sp. P ss - J j is the unique band extension 
of A, and the Gaussian random vector with zero mean and 
covariance p Wiener (resp. P 33-1 ) is the unique solution 
to the MaxEnt covariance completion problem 

maximize H(X) 

P ( 11 ) 

subject to P is any positive extension of A 

where X is a zero mean random vector with covariance 
matrix P. 

Remark 4 To our best knowledge, for the Wiener ker¬ 
nel (3) the special structure and its MaxEnt interpreta¬ 
tion has not been pointed out before. For the SS-1 kernel 
(4), the result under the uniform sampling assumption is 
given in [11] and thus is a special case of this paper. 

4 Conclusion 

We have shown that a zero mean Gaussian process with 
the first-order stable spline kernel solves a maximum en¬ 
tropy problem with the constraint that the variance of 
neighboring impulse response coefficients at U < ti+\ 
is proportional to e~^ fi — e~^ ti+1 , which decays to zero 
ultimately. Its kernel matrix (also true for the Wiener 
kernel) solves a maximum entropy covariance comple¬ 
tion problem and has special structure, e.g., its inverse 
is tridiagonal, under general sampling assumptions. Fi¬ 
nally, one may wonder if the other kernels, e.g., the di¬ 
agonal correlated kernel, can be given similar maximum 
entropy interpretation. The answer is more involved and 
will be discussed separately, see e.g., [15]. 
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Appendix 

Proof of Lemma 1 

First, we recall the well-known MaxEnt property of 
white Gaussian noise. 

Lemma 2 [If, Burg’s MaxEnt Theorem, pagefll] Con¬ 
sider the white Gaussian noise w(t). For any n € N, it 
is the solution to the MaxEnt problem: 

maximize H(r(t 0 ), r(ti), ■ ■ ■ , r(f„_i)) (12) 

r(i) 

subject to E(r(L)) = 0 ,Y(r(L)) = c, i = 0 ,n — 1 

Then from (2), define vttA = ^ i = \,... 

y/ti—ti— 1 

We have E(v(i;)) = 0, V(v(ti)) = c, i = 1, • • • , n, and 

k 

h(tk) =Pv(tj)y/ti - tj-i,k = 1,2, - • • , n (13) 

i =1 

Now let L = \h(ti) h(t 2 ) ■ ■ ■ h(t n )] T , V = [u(fi) u(f 2 ) 
••• v(t n )] T , and B be a lower-triangular matrix with 
B;, j = y/tj — tj-i for i > j. Then we have L = BV. 
Apparently, B is nonsingular in that all main diagonal 
elements are strictly positive. Further noting the prop¬ 
erty (see e.g., [14, Corollary to Theorem 8.6.4]) that 
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H(L) = H (V)+log det(-B) yields that the MaxEnt prob¬ 
lem (2) is equivalent to 

maximize H(v(t\), vfa), • • • , v(t n )) + logdet(P) 

v(t) 

subject to E(v(ti)) = 0,Y(v(U)) = c,i = 1,• • • ,n 

(14) 

Since the matrix B is independent of v(t) or h(t), the 
maximum entropy problem (14) is further equivalent to 
(12). As a result, the optimal v(t) to (14) is the white 
Gaussian noise w(t). Finally, comparing (13) with (1) 
yields that the constructed Gaussian process g(t) in (1) 
is indeed the optimal solution to (2). 

Proof of Theorem 1 

We first introduce a time transformation, and define 

n = e _/3tn -% (15) 

fin) = h(- log(r i)//3), i = 0, ■ • • , n. (16) 

Then the MaxEnt problem (6) is equivalent to 

maximize H{f{r 1 ), /(r 2 ), • • • ,/(r„)) 
fir) 

subject to V(/(ri) - f(n- 1 )) = c (p - n- 1 ) ( 17 ) 

E(/(p)) = 0 ,i = !,-■■ ,iT- 


Proof of Corollary 1 

By completing the squares, 8 T (P ss ' 1 )~ 1 d = J2k= l 
(Ok — 0k+ 1 ) 2 + Sn, n 0n where 6 G K” and Ok is the kth 
element of 0. Then (9) follows immediately. 

Proof of Theorem 2 

We first recall a lemma from band matrix extension 
problems, that is a result of [16, Theorem 2.1, page 898, 
Theorem 2.2, page 899, Corollary 1.5, page 945]. 

Lemma 3 [16] Assume that A is an m-band matrix with 
dimension n > m + 1 and that the submatrices [A], m+I , 
i = 1, • • • ,n — m, are positive definite, where [A]], with 
s < l denotes the submatrix of A from the sth row (resp. 
column) to the Ith row (resp. column). Then we have: 

(a) M is the unique band extension of A. 

(b) The Gaussian random vector with zero mean and 
covariance matrix M is the unique solution to the 
MaxEnt problem 

maximize H(X) 

x ( 19 ) 

subject to P is any positive extension of A 


where it is assumed that /(to) = 0 with tq = 0. By 
Lemma 1, the optimal solution to (17) is the Gaussian 
process g(r) defined as follows: 

9(to) = 0 with r 0 = 0, 

k^ ( 18 ) 

g(Tk) = w(Ti)y/Tj ~ Tj-i,k = 1,2, • • • 

i=l 

where w(t) is the white Gaussian noise defined on 
{ to , ti ,---}. Finally, noting (16) and (15) yields that 
the optimal solution to (6) is (5). Apparently, (5) is a 
Gaussian process with zero mean and the SS-1 kernel 
as its covariance function. This completes the proof. 

Proof of Proposition 1 

For the proof of the results hereafter, we only give the 
proof for the SS-1 kernel and that for the Wiener kernel 
can be derived in the same way and thus is omitted. 

From (5), define x = [aq, • ■ • ,x n ] T with Xk = h°(tk) — 
h(tk- i ), k = 1, • • • , n. Then we have x ~ A/”(0, Q) where 
Q is a diagonal matrix with — e~ /3t % 

i = 1, ■ ■ ■ ,n. Moreover, (pSS- 1 )- 1 = V~ T Q~ 1 V where 
V is an upper bidiagonal matrix with all main diagonal 
elements equal to — 1 and the first upper off-diagonal el¬ 
ements equal to 1. Apparently, (pS 8 - 1 )^ 1 takes the form 
in part b), which completes the proof. 


where X is a zero mean random vector with covari¬ 
ance matrix P. 

Apparently, A in (10) is a 1— band matrix and [A]l , 
i = 1, • • • , n — 1, are positive definite. This means that 
the results of Lemma 3 hold for A in (10) and the re¬ 
maining task is to show M = P ss_1 , i.e., the optimal so¬ 
lution P° pt of (19) is P° pt = P 33 " 1 . This task can be ac¬ 
complished by noting the relation between the problems 
(19) and (6). Note that the Gaussian process (5) solves 
the problem (6) and has the SS-1 kernel as its covari¬ 
ance function. Assume Z ~ Af(Q,P). Then for n > 3, 
the covariance matrix p ss_1 is the optimal solution to 

maximize H(Z) 

p 

subject to P iti + Pj+iy+i - 2P i)i+ i 

= e _/3ti - e ~6U+i ; i = 1^-... , n — 1 
P is positive definite (20) 

Also note that the constraint in (19) is a subset of the 
constraint in (20), hence H(X) < H(Z) with X ~ 
Af(0,P Opt ) and Z ~ JV(0, P^' 1 ). Finally, noting that 
P 83 " 1 is a positive extension of A and the uniqueness of 
pOpt yieJdg pOpt _ pSS-i_ phig completes the proof. 
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